Learning Classifiers for Assigning Protein Sequences to Gene Ontology Functional Families

نویسندگان

  • Carson Andorf
  • Adrian Silvescu
  • Drena Dobbs
  • Vasant Honavar
چکیده

Assigning putative functions to novel proteins and the discovery of sequence correlates of protein function are important challenges in bioinformatics. In this paper, we explore several machine learning approaches to data-driven construction of classifiers for assigning protein sequences to appropriate Gene Ontology (GO) function families using a class conditional probabilistic representation of amino acid sequences. Specifically, we represent protein sequences using class conditional probability distribution of amino acids (amino acid composition) or short (k-letter) subsequences (k-grams) of amino acids. We compare a model (NB k-grams) that ignores the statistical dependencies among overlapping k-grams with an alternative, NB(k), that uses an undirected probabilistic graphical model that captures the relevant dependencies. These two methods require only one pass through the training data during the learning phase, making them especially attractive in settings where there is a need to update the classifiers as new training data become available. We also explore a support vector machine (SVM) classifier, SVM k-grams, trained on the k-gram class conditional probability distributions of sequences. We report the performance of the resulting classifiers on three data sets of functional families from the Gene Ontology (GO) database. Our results show that NB(k) classifier outperforms NB k-grams in terms of accuracy of classification (as measured by cross-validation) by a few percentage points. SVM k-grams outperforms NB(k) in the majority of test cases. These results suggest the possibility of developing fully automated and computationally efficient approaches to construction of classifiers based on undirected graphical models of overlapping k-grams that can be easily updated as additional training data become available. Our results also show that further gains in accuracy of the classifiers are achievable (at the expense of increased computational demands and hence greater difficulty of frequent updates to the classifier as new training data become available) using SVM k-grams.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prediction of Protein Sub-Mitochondria Locations Using Protein Interaction Networks

Background: Prediction of the protein localization is among the most important issues in the bioinformatics that is used for the prediction of the proteins in the cells and organelles such as mitochondria. In this study, several machine learning algorithms are applied for the prediction of the intracellular protein locations. These algorithms use the features extracted from pro...

متن کامل

Data-Driven Generation of Decision Trees for Motif-Based Assignment of Protein Sequences to Functional Families

This paper describes an approach to data-driven discovery of sequence motif-based models in the form of decision trees for assigning protein sequences to functional families. Unlike approaches that try to classify protein sequences based on presence of a single motif, this method is able to capture regularities that can be described in terms of presence or absence of arbitrary combinations of m...

متن کامل

Automated data-driven discovery of motif-based protein function classifiers

AUTOMATED DATA-DRIVEN DISCOVERY OF MOTIF-BASED PROTEIN FUNCTION CLASSIFIERS Xiangyun Wang, Diane Schroeder, Drena Dobbs, and Vasant Honavar Artificial Intelligence Laboratory Department of Computer Science and Graduate Program in Bioinformatics and Computational Biology Iowa State University Ames, IA 50011, USA www.cs.iastate.edu/~honavar/aigroup.html [email protected] ABSTRACT This paper ...

متن کامل

GENERATING FUZZY RULES FOR PROTEIN CLASSIFICATION

This paper considers the generation of some interpretable fuzzy rules for assigning an amino acid sequence into the appropriate protein superfamily. Since the main objective of this classifier is the interpretability of rules, we have used the distribution of amino acids in the sequences of proteins as features. These features are the occurrence probabilities of six exchange groups in the seque...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004